{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running Meta Llama 3.1 on Google Colab using Hugging Face transformers library\n",
    "This notebook goes over how you can set up and run Llama 3.1 using Hugging Face transformers library\n",
    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Steps at a glance:\n",
    "This demo showcases how to run the example with already converted Llama 3.1 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
    "\n",
    "To use already converted weights, start here:\n",
    "1. Request download of model weights from the Llama website\n",
    "2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start). \n",
    "3. Run the example\n",
    "\n",
    "\n",
    "Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n",
    "1. Request download of model weights from the Llama website\n",
    "2. Clone the llama repo and get the weights\n",
    "3. Convert the model weights\n",
    "4. Prepare the script\n",
    "5. Run the example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using already converted weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Request download of model weights from the Llama website\n",
    "Request download of model weights from the Llama website\n",
    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
    "\n",
    "Fill  the required information, select the models “Meta Llama 3.1” and accept the terms & conditions. You will receive a URL in your email in a short time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Prepare the script\n",
    "\n",
    "We will install the Transformers library and Accelerate library for our demo.\n",
    "\n",
    "The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n",
    "The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install transformers\n",
    "!pip install accelerate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "import transformers\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3.1-8B-Instruct`. Using Meta models from Hugging Face requires you to\n",
    "\n",
    "1. Accept Terms of Service for Meta Llama 3.1 on Meta [website](https://llama.meta.com/llama-downloads).\n",
    "2. Use the same email address from Step (1) to login into Hugging Face.\n",
    "\n",
    "Follow the instructions on this Hugging Face page to login from your [terminal](https://huggingface.co/docs/huggingface_hub/en/quick-start). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install --upgrade huggingface_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = transformers.pipeline(\n",
    "\"text-generation\",\n",
    "      model=model,\n",
    "      torch_dtype=torch.float16,\n",
    " device_map=\"auto\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Run the example\n",
    "\n",
    "Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n",
    "\n",
    "Let’s also generate a text sequence based on the input that we provide. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sequences = pipeline(\n",
    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
    "    do_sample=True,\n",
    "    top_k=10,\n",
    "    num_return_sequences=1,\n",
    "    eos_token_id=tokenizer.eos_token_id,\n",
    "    truncation = True,\n",
    "    max_length=400,\n",
    ")\n",
    "\n",
    "for seq in sequences:\n",
    "    print(f\"Result: {seq['generated_text']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "### Downloading and converting weights to Hugging Face format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Request download of model weights from the Llama website\n",
    "Request download of model weights from the Llama website\n",
    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
    "\n",
    "Fill  the required information, select the models \"Meta Llama 3\" and accept the terms & conditions. You will receive a URL in your email in a short time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Clone the llama repo and get the weights\n",
    "Git clone the [Meta Llama 3 repo](https://github.com/meta-llama/llama3). Run the `download.sh` script and follow the instructions. This will download the model checkpoints and tokenizer.\n",
    "\n",
    "This example demonstrates a Meta Llama 3.1 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Convert the model weights using Hugging Face transformer from source\n",
    "\n",
    "* `python3 -m venv hf-convertor`\n",
    "* `source hf-convertor/bin/activate`\n",
    "* `git clone https://github.com/huggingface/transformers.git`\n",
    "* `cd transformers`\n",
    "* `pip install -e .`\n",
    "* `pip install torch tiktoken blobfile accelerate`\n",
    "* `python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${path_to_meta_downloaded_model} --output_dir ${path_to_save_converted_hf_model} --model_size 8B --llama_version 3.1`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### 4. Prepare the script\n",
    "Import the following necessary modules in your script: \n",
    "* `AutoModel` is the Llama 3 model class\n",
    "* `AutoTokenizer` prepares your prompt for the model to process\n",
    "* `pipeline` is an abstraction to generate model outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import transformers\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "model_dir = \"${path_the_converted_hf_model}\"\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "        model_dir,\n",
    "        device_map=\"auto\",\n",
    "    )\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_dir)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`)  among various other options. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = transformers.pipeline(\n",
    "    \"text-generation\",\n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    "    torch_dtype=torch.float16,\n",
    "    device_map=\"auto\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n",
    "\n",
    "By changing `max_length`, you can specify how long you’d like the generated response to be. \n",
    "Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n",
    "\n",
    "In your script, add the following to provide input, and information on how to run the pipeline:\n",
    "\n",
    "\n",
    "#### 5. Run the example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sequences = pipeline(\n",
    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
    "    do_sample=True,\n",
    "    top_k=10,\n",
    "    num_return_sequences=1,\n",
    "    eos_token_id=tokenizer.eos_token_id,\n",
    "    max_length=400,\n",
    ")\n",
    "for seq in sequences:\n",
    "    print(f\"{seq['generated_text']}\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
